DenseUNets with feedback non-local attention for the segmentation of specular microscopy images of the corneal endothelium with guttae

Corneal guttae, which are the abnormal growth of extracellular matrix in the corneal endothelium, are observed in specular images as black droplets that occlude the endothelial cells. To estimate the corneal parameters (endothelial cell density [ECD], coefficient of variation [CV], and hexagonality [HEX]), we propose a new deep learning method that includes a novel attention mechanism (named fNLA), which helps to infer the cell edges in the occluded areas. The approach first derives the cell edges, then infers the well-detected cells, and finally employs a postprocessing method to fix mistakes. This results in a binary segmentation from which the corneal parameters are estimated. We analyzed 1203 images (500 contained guttae) obtained with a Topcon SP-1P microscope. To generate the ground truth, we performed manual segmentation in all images. Several networks were evaluated (UNet, ResUNeXt, DenseUNets, UNet++, etc.) and we found that DenseUNets with fNLA provided the lowest error: a mean absolute error of 23.16 [cells/mm\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{2}$$\end{document}2] in ECD, 1.28 [%] in CV, and 3.13 [%] in HEX. Compared with Topcon’s built-in software, our error was 3–6 times smaller. Overall, our approach handled notably well the cells affected by guttae, detecting cell edges partially occluded by small guttae and discarding large areas covered by extensive guttae.

. Flowchart: CNN-Edge detects the cell edges from the specular image, CNN-Body infers the cell bodies that are well detected, and the postprocessing refines the edge images and applies the ROI from the body images to provide the final segmentation. The final segmentation (edges in red, vertices in yellow, non-ROI area in blue) is dilated and superimposed onto the specular image for illustrative purposes. www.nature.com/scientificreports/ independent cells, is also evaluated. Finally, these frameworks are compared against the manual annotations and the estimates provided by the microscope's software.

Results
Evaluation metrics. The metrics to evaluate the CNN output images were accuracy, the Sørensen-Dice coefficient (DICE), and modified Hausdorff distance (MHD) 38 . MHD is a good metric to evaluate the edge images because it measures how close the detected edges are from the manual annotations, whereas DICE can better evaluate the performance for the body images since it measures the overlap of the region of interest (ROI). The evaluation metrics for the corneal parameters were mean absolute error (MAE) and mean absolute percentage error (MAPE). To assess the statistical significance between methods, we used the paired Wilcoxon test to compare the MAPE after assigning a 100% error if no parameter estimate was produced. To assess the clinical statistical significance of our method against the gold standard, we used the Kruskal-Wallis H test. All tests used a statistical significance established at α = 0.05.
Comparison between types of CNNs. Quantitatively (Table 1), the basic UNet, ResUNeXt, and Den-seUNet seemed to provide similar performance, with slightly better results for DenseUNet (more noticeable if we look at the MAE of the corneal parameters). However, the differences between those networks were discernible in the qualitative results ( Fig. 3): in the edge images, UNet and ResUNeXt provided rather binary results and were very conservative in the areas with large guttae (Fig. 3b,c, central area), whereas DenseUNet provided more probabilistic outputs and was able to extend the edge inference into those guttae areas, reaching beyond the manual annotations (Fig. 3d). In the case of the body images, DenseUNet showed a similar probabilistic behavior (Fig. 3), whereas ResUNeXt had significant problems and the lowest performance (Table 1). Thus, DenseUNet was the network to further investigate. Red and green arrows indicate two peculiarities of the targets (to be discussed later). www.nature.com/scientificreports/ One key point for the success of the edge inference within the guttae areas was the selective training procedure, where some of the images of the batch were sampled from specific guttae subsets. In fact, if the DenseUNets were trained without any stratified subsampling (random selection), such edge inferences did not appear (Fig. 3e). Another interesting observation was that the use of the multiscale networks (+ and ++) had no impact on the edge images but provided a clear improvement in the body images for all types of networks (Table 1), although this did not necessarily result in more accurate parameter estimations.
Selection of attention mechanism. In 2018, Oktay et al. 37 proposed an attention mechanism in UNets, which modifies the output of a transition block by using the tensors from the lower resolution stage. In our experiments, we tested several variations of this mechanism, but none provided a clear improvement. The best design (Table 1, most like Oktay's proposal) gave sharper edge images in complicated areas but it occasionally provided poor results for the body images (Fig. 3f).
In this work, we developed a different attention mechanism that would exploit the non-local dependencies between features of different resolution stages while simultaneously bringing back the attention information from the lowest stages back to the highest ones. Quantitatively, the improvement was subtle but perceptible in the basic DenseUNet: the MHD in the edge images was substantially smaller in all cases, and all metrics in the body images improved (Table 1). Qualitatively, they all provided sharper edge images (there were less double-edge artifacts within guttae areas, Fig. 3), and the inference in the body images was more probabilistic (cells within guttae areas appeared with lower intensity instead of just pitch black). Thus, our attention mechanism seemed to correctly use the surrounding information to determine, in a probabilistic manner, whether a cell should be discarded. With regards to the type of aggregation at the end of the attention block, the concatenative and the multiplicative type provided similar results; in contrast, the additive type had the lowest performance in all the experiments. Therefore, we chose the multiplicative type because it uses fewer parameters and provides the overall smallest MAE in the corneal biomarkers (Table 1). On the other hand, the use of fNLA in the multiscale versions +/++ did not yield a clear benefit.
Postprocessing tuning. The postprocessing had two hyperparameters to tune: a threshold to discard weak edges ('edge threshold'), and a threshold to discard superpixels from the final segmentation ('body threshold'). We performed a combined grid search and found that, for the edge threshold: (i) a value around 0.1 minimized the error for CV regardless of the body threshold; (ii) for HEX, the optimal value was 0.1 for a body threshold of 0.5 or lower, but it shifted towards 0.2 as we increased the body threshold towards 1; and (iii) for ECD, a higher threshold (0.2-0.3) was better but the differences were very small. Overall, the purpose of the edge threshold is to simply discard false edges created during the postprocessing (watershed). In that respect, a value of 0.1 seemed to be a reasonable choice.
As for the body threshold, we observed that a value of 0.5 (which is the most intuitive choice) yielded the lowest errors in the three corneal parameters for the networks without attention, but it shifted towards 0.75 if we employed networks with fNLA, although the error differences were very small. However, visual inspection showed that no major mistakes were produced with a threshold of 0.5 (Fig. 4, cells with brighter green). We believe that a lower threshold sometimes includes cells that were not annotated in the gold standard, therefore the differences do not indicate segmentation mistakes but mainly register the so-called 'dissimilarity due to cell variability' (i.e. the estimated parameters can vary if a different set of cells are used to estimate them). www.nature.com/scientificreports/ Finally, the postprocessing computed the hexagonality by using the cell vertices (vertex method) instead of counting the neighboring cells (neighbor method, which inevitably discards the cells in the periphery). For those peripheral cells, it was possible to detect their vertices if a portion of the peripheral edges was visible. This was true for most cases, although sometimes the vertex detection was faulty ( Fig. 4IV-e,f). Nevertheless, the MAE in our HEX estimations was 3.14 [%] with the vertex method and 4.13 [%] with the neighbor method. Thus, the vertex method was better even when there were some mistakes in the detection of peripheral vertices.
Comparison between frameworks. The performance of Topcon's algorithm was considerably inferior (Table 2), particularly for the images with guttae, where it failed to detect any cell in 30% of the images (as in Fig. 4c,f,g,h) and it only detected one third of the cells (on average) in the remaining images.
Regarding the previous approach CNN-ROI 28 , the network had no problems to accurately infer the targets (Fig. 2j). However, the question was whether such targets are optimal to subsequently identify the well-detected cells. The error analysis indicated that the CNN-ROI method detected approximately 10-12 more cells per image, but there were a few segmentation mistakes among those cells, and the estimation errors were significantly worse than the CNN-Body approach ( Table 2). The paired Wilcoxon test indicated a statistically significant difference between the estimates of approaches CNN-Body and CNN-ROI ( P < 0.0001 , all biomarkers).
As for the CNN-Blob, it detected the same number of cells as CNN-Body (Table 2) and the estimation errors were virtually the same (slightly worse for CNN-Blob in the images with guttae; Table 2). The paired Wilcoxon test indicated no statistically significant difference between the two estimates ( P = 0.69 , P = 0.13 , and P = 0.69 , for the ECD, CV, and HEX, respectively). Nevertheless, it is easier for a human observer to interpret what cells are well detected in the CNN-Body output (Fig. 2h) than in the CNN-Blob output (Fig. 2i). (IV) the final segmentation (edges in red, vertices in yellow, non-ROI in blue, and the detected cells in two tonalities of green: brighter green for the cells whose body-intensity was between 0.50-0.75, and darker green for the cells whose body-intensity was between 0.75-1.00), and (V) the gold standard (the annotated edges in yellow; non-ROI in blue). On the right, the estimated parameters (gold standard values in parenthesis). Topcon's software detected 45, 34, 0, 38, 6, 0, 0, and 0 cells (a-h). www.nature.com/scientificreports/ Statistical analysis. The distributions of the estimated parameters from the CNN-Body method and the gold standard passed the Levene's test for homogeneity, but they did not pass the Shapiro-Wilk normality test. Thus, the Kruskal-Wallis H test was performed and it showed no statistically significant difference between the manual and our automatic assessments for ECD ( P = 0.81 ) and CV ( P = 0.74 ), but it did for HEX ( P < 0.001 ).
To further assess the estimates, we performed a Bland-Altman analysis, and it showed that more than 95% of the estimates were within the 95% limit of agreement for all parameters: 96.5% for ECD, 96.6% for CV, and 96.7% for HEX.
Error analysis. We plotted the error of the corneal parameters as a function of the number of cells (Fig. 5) and we fitted two exponentials to the mean and SD of the error using the least-squares method. The error estimates showed a normal distribution along the y-axis for the three parameters, which allowed us to assume that the area within two SDs covered approximately the 95% of the error. Images with and without guttae where plotted with different colors Fig. 5. This evaluation showed that, for images with high number of cells, the error is very low regardless of the presence of guttae; in fact, guttae is a critical factor only for images with very low number of cells. Overall, we can conclude that (i) the number of cells was the main variable to predict the reliability of the estimation, with a clear decrease in the error spread as more cells were detected; (ii) the unreliable cases were images with guttae and less than 25 cells; (iii) HEX required more cells to reduce the error spread, and (iv) there was a notable underestimation in HEX for the images with less than 100 cells.
Clinical analysis. The number of cells necessary to estimate ECD with high reliability is widely accepted as 75 cells 39 . At that point, our estimated ECD error was approximately 0 ± 35 [cells/mm 2 ] (mean ± SD). At 25 cells, it was 0 ± 45 [cells/mm 2 ]. This is expected to be much lower than the uncertainty generated by the "cell variability". Doughty et al. 39 evaluated that uncertainty (in manual annotations) and concluded that estimating ECD using 75 cells entails to assume an uncertainty (of 1 SD) of ±2% (or ±70 cells/mm 2 , which is twice than our method's error), whereas the uncertainty at 25 cells increases to ±10% (or ±350 cells/mm 2 , more than seven times larger than our method's error). Nevertheless, both elements (uncertainty and error) should be taken into consideration when evaluating the reliability of the estimations. Regarding CV and HEX, there are no studies in the literature about the effect of the cell variability to allow us to make a comparison.
We also observed that the number of visible cells and ECD decrease acutely as the amount of guttae increases (Fig. 6), whereas neither CV nor HEX showed a significant change. While it is generally accepted that a CV of less than 30% and a HEX of greater than 60% is usually a sign of a healthy endothelium 28 , this assumption seems to be untrue for the current dataset. Thus, our results seem to suggest that guttae does not have an impact on neither the cell size variation nor the hexagonality of the visible cells (those that are not actually affected by extensive guttae).
Overall, our proposed method has a very low error spread for the three biomarkers in images graded with mild or moderate guttae (Fig. 6, up to grade 4). For the cases with severe guttae (grade 5-6), the main problem is the low number of detected cells (in grade 6, the average number of cells per image was 13 ± 10), which translates Table 2. The MAE of the endothelial parameters in the images with/without guttae, the percentage of images with estimates (success), and the average number of cells per image, for our methods and Topcon (the latter has two success percentages: one for ECV/CV and another for HEX). The manual annotations had an average of 148 and 166 cells for the images with/without guttae, respectively.  www.nature.com/scientificreports/ into a large estimation uncertainty, particularly for CV and HEX (Fig. 6). As shown in Fig. 4g,h, we did not observe major segmentation mistakes in those cases, but the low number of visible cells makes the biomarker estimation unreliable.

Discussion
We have presented a fully automatic method for the estimation of the corneal endothelium parameters from specular microscopy images that contain guttae. This is the first time that such images have been solved properly in the literature.
The main factor to achieve such accuracy was the way annotations were performed. We observed that, if the edges hidden by the guttae could be manually delineated, the networks could learn to replicate such pattern. Since endothelial cell edges usually appear as straight lines, the extension of partially occluded edges is sufficient to infer the hidden tessellation if the guttae are small enough. In fact, we observed that our model went beyond what our annotator could do. For example, the guttae indicated by the green arrow in Fig. 2 was too large to be certain about the hidden edges, yet the network provided a segmentation that seemed highly probable. However, further experiments would be needed to assert the trustworthiness of such inference; as a suggestion, the artificial generation of black spots in the images (mimicking the guttae's visuals) could be a way to evaluate such hypothesis, although that would raise questions regarding the experiment's reliability.
Our proposed attention mechanism (fNLA) moderately improved the performance in both types of CNNs (CNN-Edge and CNN-Body). Since the basic DenseUNet already achieved an excellent performance, the improvement provided by our attention block was modest. Nevertheless, we believe that the feedback-attention path depicted in our proposed network could yield a good performance in other types of segmentation problems, particularly the ones that infer areas instead of edges.
One interesting behavior of this framework is that, while CNN-Edge provides good inference beyond the manual annotations, CNN-Body is rather conservative. This is because the targets of CNN-Body were based on the original annotations and, thus, any inference by CNN-Edge that surpasses the manual annotations are not considered in the target of CNN-Body. While this might seem a negative quality, the results show that our framework detects practically the same number of cells as the manual annotations and barely any segmentation mistake is observed even in the most difficult cases (Fig. 4). Therefore, we believe this approach is preferred to more daring ones, like CNN-ROI (Table 2).
Overall, our estimates agreed very well with the gold standard and they were significantly better than the ones provided by the instrument's software, demonstrating the ability of this artificial intelligence framework to accurately estimate the endothelial parameters from images with guttae.

Materials and methods
Datasets. Two datasets were employed in this work. The first dataset came from a clinical study concerning the implantation of a Baerveldt glaucoma implant in the Rotterdam Eye Hospital (Rotterdam, the Netherlands). The clinical study contained 7975 images from 204 patients (average age 66±10 years), who were imaged with a specular microscope (Topcon SP-1P, Topcon Co., Tokyo, Japan) before the device implantation and at 3, 6, 12, and 24 months after, in both the central (CE) and the peripheral supratemporal (ST) cornea. The protocol required five specular images to be taken in each area for each visit, although it was sometimes difficult to reach that number of gradable images (specifically, an average of 4.7 CE images and 3.6 ST images per visit were acquired). Retrospectively, we observed that 15 patients had clear signs of FED stage two: they all had guttae in both CE and ST cornea, with a larger amount in CE and a clear increase in ST during the two-year follow-up. From this subset of patients, 193 images were collected. Furthermore, we observed that another 81 patients had small amounts of non-confluent guttae in the CE cornea (either FED stage one or due to normal aging), and 227 images were collected from these cases. In total, 420 images with presence of guttae were selected from this study. In addition, 400 images from other patients in the study without signs of guttae were selected to build a balanced dataset.
The second dataset came from another clinical study in the Rotterdam Eye Hospital regarding the transplantation of the cornea (ultrathin Descemet Stripping Automated Endothelial Keratoplasty, UT-DSAEK). This dataset contained 383 images of the central cornea from 41 eyes (41 patients, average age 73±7 years) and they were acquired at 1, 3, 6, and 12 months after surgery with the same specular microscope Topcon SP-1P. The included population for the study were patients over 18 years old with FED planned for keratoplasty. Among these patients, FED reappeared in one of them. Another 13 patients showed a small amount of non-confluent, stable guttae during the one-year follow-up. In total, 80 images out of the 383 showed some guttae (all images were included in the present work). www.nature.com/scientificreports/ Altogether, the combined dataset contained 1203 images, in which 500 images depicted guttae in various magnitudes. The images covered an area of approximately 0.25 mm × 0.55 mm and were saved as 8-bit grayscale images of 240 × 528 pixels. All images were manually segmented to create the gold standard, using the opensource image manipulation program GIMP (version 2.10). Furthermore, we collected the endothelial parameters provided by the microscope (Topcon SP-1P performed this with the software IMAGEnet i-base, version 1.32).

Ethics approval and consent to participate
Data was collected in accordance with the principles of the Declaration of Helsinki (October, 2013). Signed informed consent was obtained from all subjects. Participants gave consent to publish the data. Approval was obtained from the Medical Ethical Committee of the Erasmus Medical Center, Rotterdam, The Netherlands (MEC-2014-573). Trial registration for the Baerveldt study: NTR4946, registered 06/01/2015, URL https:// www. trial regis ter. nl/ trial/ 4823. Trial registration for the UTDSAEK study: NTR4945, registered on 15-12-2014, URL https:// www. trial regis ter. nl/ trial/ 4805.
Grading the dataset. The 500 images with guttae were graded based on their complexity to segment them, taking two metrics into account: the amount of guttae and blur present in the image. For both metrics, images were given a value of 1 (mild), 2 (moderate), or 3 (severe), the final grade being the sum of both values. As a result, there were 134 images with low complexity (grades 1-2), 235 images with medium complexity (grades [3][4], and 131 images with high complexity (grades 5-6).
Targets and frameworks. CNN-Edge is the core of the method. If the specular image had good quality (high contrast) and with cells visible in the whole image, the resulting edge image would probably be so well inferred that a simple thresholding and skeletonization would suffice to obtain the binary segmentation. However, these issues (low contrast, blurred areas, and guttae) are present in the current images. In contrast, CNN-Body has the goal of providing a ROI image to discard areas masked by extensive guttae or blurriness.
These CNNs are trained independently and they all have the same design; thus, they are simply trained with different inputs and targets. To create the targets, we make use of the manual annotations (i.e. gold standard), which are binary images where value 1 indicates a cell edge (edges are 8-connected-pixel lines of 1 pixel width), value 0 represents a full cell body, and any area to discard (including partial cells) is given a value 0.5. If a blurred or guttae area is so small that the cell edges could be inferred by observing the surroundings, the edges are annotated instead (Fig. 2f). For all the annotated cells, we identify their vertices; this allows computing the parameter hexagonality from all cells and not only the inner cells (in the latter, HEX is computed by counting the neighboring cells, thus the cells in the periphery of the segmentation are not considered; this way of computing HEX was used in previous publications [23][24][25]28,40 and it is how Topcon's built-in software computes it). Therefore, HEX is now defined as the percentage of cells that have six vertices.
The target of the CNN-Edge only contains the cell edges from the gold standard images, which have been convolved with a 7 × 7 isotropic unnormalized Gaussian filter of standard deviation (SD) 1 (Fig. 2b). This provides a continuous target with thicker edges, which proved to deliver better results than binary targets 23 .
The target of the CNN-Body only contains the full cell bodies from the gold standard images, and partial cells are discarded either because they are partially occluded by large guttae or they are at the border of the image (Fig. 2c). The same probabilistic transformation is applied here. Alternatively, we also evaluated whether a target that also includes the edges between the cell bodies (Fig. 2d) was a better approach (named CNN-Blob).
This framework is similar to our previous approach 28 , where a model named CNN-ROI, whose input is the edge image, provides a binary map indicating the ROI (Fig. 2j). To create its target, the annotator would simply draw the rough area from the edge images (Fig. 2g) that they would choose as trustworthy, creating a binary target (Fig. 2e).
Backbone of the network. The proposed CNN has five resolution stages (Fig. 7a). We tested three designs depending on the connections of the convolutional layers within each node: consecutively (as in UNet 33 ), with residual connections (ResUNeXt 34 ), and with dense connections (DenseUNet 35 ). In addition, we also explored two multiscale designs (commonly refered as + [plus] and ++ [plusplus] 36 ) in the aforementioned networks: UNet+ (Fig. 7d), UNet++, ResNeXt+, ResNeXt++, DenseUnet+, and DenseUnet++. Our ++ designs differentiates from + ones in that the former use feature addition from all previous transition blocks of the same resolution stage. In total, nine basic networks were tested (the code for all cases can be found in our GitHub).
The attention mechanism. The core of the attention block (Fig. 7f)  www.nature.com/scientificreports/ other differences) Q, K, and V in Fig. 7f are the same tensor. In our case, the attention block is added at the end of each dense block and it is named feedback non-local attention (fNLA, Fig. 7f) or self-non-local attention (sNLA), depending where it is placed within the network (Fig. 7c): if there exists a tensor from a lower resolution stage, the attention mechanism makes use of it (fNLA), but in the absence of such lower tensor, a self-attention operation is performed (sNLA, where φ and g in Fig. 7f become a 1 × 1 convolution with the same input X a,b ). In the DenseUNet, the nodes of the encoder use fNLA (except the deepest node), and the nodes of the decoder use sNLA (Fig. 7b); in the DenseUNet+/++, only the largest decoder use sNLA (Fig. 7e).
We also explored three types of attention mechanism depending on the type of aggregation at the end of the block: (a) the default case (Fig. 7f) uses multiplicative aggregation, where a single attention map (from ) is sigmoid activated and then element-wise multiplied to all input feature maps; (b) in the case of concatenative aggregation, is simply an ELU activation, whose output maps (C/8 maps) are concatenated to the input features; (c) in case of additive aggregation, becomes a 1 × 1 convolution with ReLU activation and C output feature maps, which are then summed to the input tensor. For comparison, Wang et al.'s model 44 is an sNLA with additive aggregation.
Intuitively, an sNLA mechanism computes the response at a specific point in the feature maps as the weighted sum of the features at all positions. In fNLA, the attention block maps the input tensor against the output features from the lower dense block, thus allowing the attention mechanism to use information created further ahead in the network. Moreover, the feedback path created in the encoder allows to propagate the attention features from the deepest dense block back to the first block. While endothelial specular images do not possess such long-range dependencies (features separated by a distance of 3-4 cells do not seem to be correlated), this attention operation might be useful in the presence of large blurred areas (such as guttae). In this work, we explored the use of this attention mechanism for DenseUNet and the multiscale versions DenseUNet+/++. Description of the postprocessing. The postprocessing aims to fix any edge discontinuity. Here, we have improved a process that was first described in a previous publication 23 . The steps are: (I) We estimate the average cell size (l) by using Fourier analysis in the edge image 23 . As we proved in previous work 25 , this estimation is simple, extremely robust, and accurate.
(II) As new step, we add a perimeter to the edge image with intensity 0.5. This closes any partial cell in touch with the border.
(III) We smooth the edge image with a Gaussian filter whose standard deviation is σ = k σ l , being k σ = 0.2 (this parameter was derived in a previous publication 40 ). This fixes any edge discontinuity.
(IV) We apply classic watershed 47 to the smoothed edge image, which detects weak edges. This provides a binary segmentation where edges are 8-connected-pixel lines of 1 pixel width.
(V) In another new step, we identify every edge and vertex in the segmentation. The vertices are the branch points of the segmentation, and the edges are the set of 8-connected positive pixels whose endpoints are constrained to vertices 17 . We set 2 pixels as the minimum length for an edge (edges of only 1 pixel are fused with the vertices at its endpoints to become a single vertex of 3 pixels). For every edge, we check its mean intensity (from shown as the shape of their tensors, X a,b is the tensor to be transformed, Y a+1,b is the tensor from the lower resolution scale used for attention, the blue boxes ( φ and g) indicate a 2 × 2 transpose convolution with strides 2, the red boxes ( θ and ) indicate a 1 × 1 convolution, ⊕ denotes matrix multiplication, and ⊙ denotes elementwise multiplication. In the case of sNLA, φ and g are also a 1 × 1 convolution with X a,b as input. In attention terminology 44 , Q is for query, K for key, V for value. www.nature.com/scientificreports/ the edge images) to discard the weak ones: if it is lower than 0.1 (threshold evaluated in the "Results" section), the edge is discarded. However, we make a distinction when removing edges: if the edge is internal, all pixels of the edge are completely removed and, thus, the vertex-pixels at the end of that edge become edge-pixels; in contrast, if the edge is in contact with the non-ROI area, we only remove a few pixels in the middle of the edge in order to preserve the vertices. This is relevant because we use the vertices to determine the HEX.
(VI) In the updated binary segmentation, every superpixel in contact with the border of the image is discarded. For the remaining superpixels, we checked the Body/ROI image to determine whether to keep or discard them. If using the body/blob image, a superpixel is included if the average intensity of their pixels is above 0.5. If using the ROI image, a superpixel is included if at least 85% of its area is within the ROI.
(VII) In the final segmentation, the parameters are estimated.

Implementation.
To evaluate the networks, a ten-fold cross-validation was performed (with all images from one eye in the same fold). All networks were implemented in Tensorflow 2.4.1 on a single NVIDIA V100 GPU with 32GB of memory. A training batch size of 15 images was employed, where six images were sampled from a specific guttae subgroup (two images from each complexity level). The dimensions of the Den-seUNet were selected such that it could fit within the GPU memory; to build the multiscale DenseUNet+/++, we reduced the GR to 4 so that they could still fit in memory while having a similar number of parameters. In that respect, ResUNeXt+/++ and UNet+/++ had one less convolutional block in each resolution stage than the ResUNeXt and UNet (except the lowest stage). The network hyperparameters were categorical cross-entropy as loss function, Nadam optimizer 48 , initial learning rate (lr) of 0.001, no early stop, 200 epochs for CNN-Edge (learning decay of 0.99, considering that lr epoch−i = lr · lr i decay ) and 100 epochs for CNN-Body, CNN-Blob, and CNN-ROI (learning decay of 0.97). For data augmentation, images were randomly flipped left-right and up-down, and elastic deformations were employed. The specular images were only normalized to have a range between 0-1 instead of 0-255. The networks were programmed in Python 3.7, and the parameter estimation and statistical analyses were done in Matlab 2020a (MathWorks, Natick, MA).

Data availability
The datasets used during the current study are available from the corresponding author on reasonable request.

Code availability
Code for the networks and their weights are available in our GitHub repository at https:// github. com/ jpvig ueras guill en/ feedb ack-non-local-atten tion-fNLA.